Compression of DNA sequence reads in FASTQ format

نویسندگان

  • Sebastian Deorowicz
  • Szymon Grabowski
چکیده

MOTIVATION Modern sequencing instruments are able to generate at least hundreds of millions short reads of genomic data. Those huge volumes of data require effective means to store them, provide quick access to any record and enable fast decompression. RESULTS We present a specialized compression algorithm for genomic data in FASTQ format which dominates its competitor, G-SQZ, as is shown on a number of datasets from the 1000 Genomes Project (www.1000genomes.org). AVAILABILITY DSRC is freely available at http:/sun.aei.polsl.pl/dsrc.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

BEETL-fastq: a searchable compressed archive for DNA reads

MOTIVATION FASTQ is a standard file format for DNA sequencing data, which stores both nucleotides and quality scores. A typical sequencing study can easily generate hundreds of gigabytes of FASTQ files, while public archives such as ENA and NCBI and large international collaborations such as the Cancer Genome Atlas can accumulate many terabytes of data in this format. Compression tools such as ...

متن کامل

CAN-zip – Centroid Based Delta Compression of Next Generation Sequencing Data

We present CANzip, a novel algorithm for compressing short read DNA sequencing data in FastQ format. CANzip is based on delta compression, a process in which only the differences of a specific data stream relative to a given reference stream are stored. However CANzip uniquely assumes no given reference stream. Instead it creates artificial references for different clusters of reads, by constru...

متن کامل

Disk-based genome sequencing data compression

Motivation: High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk-based (Yanovsky, 2011; Cox et al., 2012), where the be...

متن کامل

Disk-based compression of data from genome sequencing

MOTIVATION High-coverage sequencing data have significant, yet hard to exploit, redundancy. Most FASTQ compressors cannot efficiently compress the DNA stream of large datasets, since the redundancy between overlapping reads cannot be easily captured in the (relatively small) main memory. More interesting solutions for this problem are disk based, where the better of these two, from Cox et al. (...

متن کامل

Tagmentation-Based Mapping (TagMap) of Mobile DNA Genomic Insertion Sites

Multiple methods have been introduced over the past 30 years to identify the genomic insertion sites of transposable elements and other DNA elements that integrate into genomes. However, each of these methods suffer from limitations that can frustrate attempts to map multiple insertions in a single genome and to map insertions in genomes of high complexity that contain extensive repetitive DNA....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Bioinformatics

دوره 27 6  شماره 

صفحات  -

تاریخ انتشار 2011